Fixes for General Additive Models #743

rogancarr · 2018-08-27T10:37:39Z

This PR addresses a number of issues with the General Additive Model (GAM) trainer. In particular, it addresses the issues with the GAM Classifier not fitting nor producing a probability, and adds support for validation pruning, summary text, and centering of the feature effects around a mean response (e.g. intercept).

Additionally, the PR addresses some minor issues in the codebase, like GAMs using copy-and-paste versions of FastTree code, unnecessarily public attributes, unused arguments, and splits the BinaryClassifier and Regressor into separate files.

The changes are as follows:

Centered the response and added an intercept term; fixed corresponding issues with table lookups, sparse calculations, and exports. (General Additive Model features are not centered after learning. #739)
Added in a validation set and pruning based on the validation set, producing a final pruned graph at the end. (General Additive Models do not allow pruning with a validation set #740)
a. Switched to using ScoreTrackers to keep track of scores during boosted
b. Save boost iterations as individual graphs (n_features x n_iterations x n_boosts) for pruning.
Fixed GAM Classifier to use a small learning rate (FastTree Gradient of Logistic Loss prohibits small learning rates #741)
Updated the FastTreeBinaryClassification logistic loss gradient to take the sigmoid parameter as input: GAM Binary Classifier now uses unity; FastTree Binary Classifier uses the same default of 2*learning rate (no change). Optionally, we can plumb this to the FastTree arguments if we show an experimental gain; We can also experiment to see if the sigmoid parameter gives gains for GAMs (i.e. by slowing learning even more).
Added calibration to the GamClassifier to produce probabilities (General Additive Models (GAM) for Classification learn the logit, not class probability #738).
This meant changing the various interfaces.
Refactored GAMRegressor and GAMClassifier into their own files.
The one file for all GAM trainers had gotten a bit long, and this change is consistent with ML.NET tradition.
Added tests to verify train loss and validation metrics (e.g. same value is produced by boosting and the lookup table).
Added a Summary to the GAMPredictor to produce training statistics and the feature table (General Additive Model (GAM) has no summary to extract features at runtime #742).
Rewrote the core split finding routine to use the same code as FastTree to rely on FastTree unit tests
The GAM routines used a copy-and-paste of internal FastTree components. To fight entropy, these were refactored to use the same central calculation, such that it is verified and validated by the FastTree unit and end-to-end tests.
Removed unused arguments.
Allowed for weighted samples.

Fixes #738
Fixes #739
Fixes #740
Fixes #741
Fixes #742

…ulations to share one function.

… Adding a validation on the input value.

…d sparse offsets

…eeBinaryClassification Loss to take the sigmoid parameter as input; FastTree uses the default, stays the same. Gam uses Unity. Refactored GamRegressor and GamClassifier into their own files. Added tests to verify Train loss and validation metrics.

…. Adding XML Docs

… 2 into the sigmoid parameter; this allows GAMs to output features on the scale of the logit.

…se of the calibrator.

…no graph would be defined. As the graph is still accessed, it must be defined to zero in such cases.

dnfclas · 2018-08-27T10:37:51Z

All CLA requirements met.

Zruty0 · 2018-08-27T15:24:27Z

src/Microsoft.ML.FastTree/Microsoft.ML.FastTree.csproj

@@ -8,6 +8,11 @@
    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
  </PropertyGroup>

+  <ItemGroup>
+    <None Remove="GamClassification.cs" />


None Remove="GamClassification.cs" [](start = 5, length = 34)

what is this? Remove

TomFinley · 2018-08-27T16:57:51Z

src/Microsoft.ML.FastTree/Microsoft.ML.FastTree.csproj

@@ -27,6 +32,8 @@
    <Compile Include="FastTreeRegression.cs" />
    <Compile Include="FastTree.cs" />
    <Compile Include="FastTreeTweedie.cs" />
+    <Compile Include="GamClassification.cs" />
+    <Compile Include="GamRegression.cs" />
    <Compile Include="GamTrainer.cs" />


Actually this is somewhat peculiar. I know you didn't do this, but why is the explicit listing in this project necessary at all? Does anything break if you get rid of this entire ItemGroup with all the <Compile Includes?

TomFinley · 2018-08-27T16:59:26Z

Thanks for doing this @rogancarr -- should I infer from the lack of baseline changes that no tests were migrated to test GAM? I think we probably ought to do some here -- we have some standard regression problems and whatnot.

Ivanidzo4ka · 2018-08-28T04:01:07Z

test/Microsoft.ML.Predictor.Tests/TestGamTrainer.cs

+        {
+            using (var env = new TlcEnvironment())
+            {
+                var trainFile = "binary_sine_logistic_10k.tsv";


binary_sine_logistic_10k [](start = 33, length = 24)

Just a reminder what you need to check in this file into repo, or use build actions to download it.
(if you want to check it in, you also need CELA approval)

Hi @Ivanidzo4ka , I believe that this is a synthetic dataset, so probably CELA is not necessary, since it does not come from any real external source? Or am I wrong?

This is a synthetic dataset. If we want to add this test to the repo, then I'll add the file.

But more important question is: Do we need this test?

This is really a test on the internal state of the learner that I used for debugging. If we add baseline tests for GAMs, then it seems like we don't need this test, and we may not want it, since it relies on the internal state.

TomFinley · 2018-08-28T18:01:52Z

False

I think this is the culprit here, that is requiring the explicit listing.

Refers to: src/Microsoft.ML.FastTree/Microsoft.ML.FastTree.csproj:7 in d0d4e6b. [](commit_id = d0d4e6b, deletion_comment = False)

rogancarr · 2018-08-29T04:47:52Z

Yes, my best guess is that this line:
<EnableDefaultCompileItems>False</EnableDefaultCompileItems>
is why we are explicitly listing the files to compile.

rogancarr · 2018-08-29T17:19:36Z

New commit addressing comments on the PR:

Removed the strange remarks in the Project file
Updated the CSharpApi to reflect the new arguments for GAM binary classification
Updated the entrypoints manifest to reflect the new arguments for GAM binary classification

Updating baseline tests for GAM Binary Classification to reflect the new learner
Adding baseline tests for the old GAM Regression; confirming same results for new GAM Regression

… removed TrainingInfo (and therefor the test to validate training loss); removed post-training scoring update; made the statistics calculator interface internal to FastTree.

rogancarr · 2018-08-31T21:03:08Z

New commit addressing offline reviews:

Switched sub-graph calculation from multidimensional arrays to structs
Removed TrainingInfo (and therefor the test to validate training loss): We need a general way to do this and not a one-off solution for one learner.
Removed post-training scoring update: At this point it's not necessary and hurts performance
Made the statistics calculator interface internal to FastTree: It cannot be an abstract class

Run performance comparisons against the previous version for GAM{Regressor, Binary Classifier} and FastTree{Regressor, and Binary Classifer} (In Progress)
Recomputing baselines (note that the tests will fail until this is updated)

TomFinley · 2018-09-13T18:00:07Z

src/Microsoft.ML.FastTree/Dataset/FeatureFlock.cs

@@ -248,7 +248,7 @@ public void CopyFeatureHistogram(int subfeatureIndex, ref PerBinStats[] hist)
            double sumGTTargets = 0.0;
            double sumGTWeights = eps;
            int gtCount = 0;
-            sumWeights = 2 * eps;
+            sumWeights += 2 * eps;
            double gainShift = leafCalculator.GetLeafSplitGain(totalCount, sumTargets, sumWeights);


rogancarr · 2018-09-14T00:11:55Z

New commits

Fixed an introduced bug in the split finding algorithm for weighted calculations
Updated the baseline tests for GamBinaryClassifier

I performed two performance tests, and sent results to reviewers offline:

Determine if the new implementation of GAMs had an effect on goodness-of-fit (AUC, RMS) and training time
Determine if the changes to FastTree had an effect on goodness-of-fit (AUC, RMS) and training time.

Test Results:

GAM changes resulted better goodness of fit for Binary Classification, and equivalent for Regression. Regression was not numerically the same because the order of additions to the final graphs changed. The changes also introduced a slowdown on the order of 1.3 due to the way that graphs are held for pruning (many small pieces instead of one master graph). This can be targeted for future optimization (I have a few ideas).
The changes had no measurable effect on FastTree. Goodness-of-fit was numerically identical and training time was equivalent.

rogancarr · 2018-09-14T16:59:41Z

In the build, SDCA failed a test on Linux Debug -- this seems unrelated to the changes in this PR -- is it possible to rerun the tests?

Edit: I reran the tests.

Zruty0 · 2018-09-14T21:03:09Z

src/Microsoft.ML.FastTree/GamRegression.cs

+        {
+            [Argument(ArgumentType.AtMostOnce, HelpText = "Metric for pruning. (For regression, 1: L1, 2:L2; default L2)", ShortName = "pmetric")]
+            [TGUI(Description = "Metric for pruning. (For regression, 1: L1, 2:L2; default L2")]
+            public int PruningMetrics = 2;


int PruningMetrics = 2 [](start = 19, length = 22)

wow, really? Can we maybe have enum here?

This is a holdover from FastTree -- this is copied over from FastTreeRegression. It's a parameter to the RegressionTest class.

If possible, I'd like to file this as a separate issue and fix as a separate PR.

src/Microsoft.ML.FastTree/GamTrainer.cs

Zruty0

TomFinley

Thank you @rogancarr , I'm requesting a rebuild due to an unrelated SDCA test failure.

…ssary xml docs.

rogancarr · 2018-09-14T22:43:54Z

Thanks for the review @TomFinley and @Zruty0 ! I addressed all but one issue and responded to one in the comments here. Note that WIP is flagged, but this is not WIP -- I had a WIP commit that I didn't remove because we will squash-merge this branch.

Rogan Carr added 20 commits August 19, 2018 15:52

Adding a test to step through GAM

5339988

Adding weights to GAM calculation; refactor redundant Leaf Split calc…

8c5907e

…ulations to share one function.

Using MinDocsPerLeaf as set in the Arguments class during leaf split.…

ea92deb

… Adding a validation on the input value.

wip

07fc292

WIP: rebase

66c6869

Centering results and unifying outputs; fixing issues with lookups an…

7675b64

…d sparse offsets

wip

2d46bde

Adding in a validation set and validation set pruning to GAM

7057dfe

Sped up tests, added comments, cleaned up test helpers.

c20e760

GAM: remove unused calibration parameters.

da6a966

Rename GAM test file

4a00b92

Recentering graph based on mean responses.

0d6ea2d

Unified the documents to thread calculation across implementations.

ea3ab1c

Updating methods and properties to be as close to private as possible.

6bc05db

Updating methods and properties to be as close to private as possible…

b70f203

…. Adding XML Docs

Updating the FastTreeClassification Loss to incorporate the factor of…

759d597

… 2 into the sigmoid parameter; this allows GAMs to output features on the scale of the logit.

Updating the entrypoints for BinaryClassGamPredictor to reflect the u…

8719606

…se of the calibrator.

Fixing an issue with no validation and no progress.

dda64f2

Fixing null pointer bug, where if a split didn't have positive gain, …

d0d4e6b

…no graph would be defined. As the graph is still accessed, it must be defined to zero in such cases.

rogancarr requested review from TomFinley and Zruty0 August 27, 2018 10:41

Zruty0 reviewed Aug 27, 2018

View reviewed changes

TomFinley reviewed Aug 27, 2018

View reviewed changes

Ivanidzo4ka reviewed Aug 28, 2018

View reviewed changes

Responding to PR Comments

b356e43

Responding to PR Comments: Switched sub-graph calculation to structs;…

82218ee

… removed TrainingInfo (and therefor the test to validate training loss); removed post-training scoring update; made the statistics calculator interface internal to FastTree.

Fixing arithmetic error in the weighted split finding

cd8a554

TomFinley reviewed Sep 13, 2018

View reviewed changes

Updating all baseline tests for GAM Classification

638d1a7

Zruty0 reviewed Sep 14, 2018

View reviewed changes

src/Microsoft.ML.FastTree/GamTrainer.cs Outdated Show resolved Hide resolved

Zruty0 reviewed Sep 14, 2018

View reviewed changes

src/Microsoft.ML.FastTree/GamTrainer.cs Outdated Show resolved Hide resolved

Zruty0 reviewed Sep 14, 2018

View reviewed changes

src/Microsoft.ML.FastTree/GamTrainer.cs Outdated Show resolved Hide resolved

Zruty0 reviewed Sep 14, 2018

View reviewed changes

src/Microsoft.ML.FastTree/GamTrainer.cs Outdated Show resolved Hide resolved

Zruty0 approved these changes Sep 14, 2018

View reviewed changes

TomFinley approved these changes Sep 14, 2018

View reviewed changes

Addressed PR Comments: Comments to follow style guide; removed unnece…

e3d896f

…ssary xml docs.

TomFinley merged commit 345a5c2 into dotnet:master Sep 17, 2018

rogancarr deleted the roganc/gam_bug_fixes branch April 5, 2019 22:32

ghost locked as resolved and limited conversation to collaborators Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for General Additive Models #743

Fixes for General Additive Models #743

rogancarr commented Aug 27, 2018 •

edited

Loading

dnfclas commented Aug 27, 2018 •

edited

Loading

Zruty0 Aug 27, 2018

TomFinley Aug 27, 2018

TomFinley commented Aug 27, 2018

Ivanidzo4ka Aug 28, 2018

TomFinley Aug 28, 2018

rogancarr Aug 28, 2018

rogancarr Aug 29, 2018

TomFinley commented Aug 28, 2018

rogancarr commented Aug 29, 2018

rogancarr commented Aug 29, 2018

rogancarr commented Aug 31, 2018 •

edited

Loading

TomFinley Sep 13, 2018

rogancarr commented Sep 14, 2018

rogancarr commented Sep 14, 2018 •

edited

Loading

Zruty0 Sep 14, 2018

rogancarr Sep 14, 2018

Zruty0 left a comment

TomFinley left a comment

rogancarr commented Sep 14, 2018

Fixes for General Additive Models #743

Fixes for General Additive Models #743

Conversation

rogancarr commented Aug 27, 2018 • edited Loading

dnfclas commented Aug 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomFinley commented Aug 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomFinley commented Aug 28, 2018

rogancarr commented Aug 29, 2018

rogancarr commented Aug 29, 2018

rogancarr commented Aug 31, 2018 • edited Loading

Choose a reason for hiding this comment

rogancarr commented Sep 14, 2018

rogancarr commented Sep 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zruty0 left a comment

Choose a reason for hiding this comment

TomFinley left a comment

Choose a reason for hiding this comment

rogancarr commented Sep 14, 2018

rogancarr commented Aug 27, 2018 •

edited

Loading

dnfclas commented Aug 27, 2018 •

edited

Loading

rogancarr commented Aug 31, 2018 •

edited

Loading

rogancarr commented Sep 14, 2018 •

edited

Loading